-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mixed precision training to TorchEngine #1322
Conversation
uses torch_amp_options as config dict with "dtype" option. Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine.
I wonder, in what cases would you want that the model params also use the same dtype? I have read that when using bfloat16, you usually always want that the params are also stored in bfloat16. With float16 AMP training, is it normal that you would keep the params in float32? What about TensorFloat32? So I wonder if this is sth which should also automatically be handled via torch_amp_options or whether that should be a separate option. And I wonder how people usually do it. |
I just merge this as an initial version now. Can you comment on my questions? |
Another question: Was it by intention that you allowed that the user does not specify |
So yes, you should not do anything to the model explicitly because autocast is handling that.
No, this is a mistake, it should be given. I do not know what the behavior is then, but likely not what is intended. |
Autocast is automatically casting inputs to certain PyTorch ops. The parameters are not changed, they are just casted automatically for those ops. But this is not really my question. My question is: Wouldn't it make more sense to directly have the parameters in float16? |
Follow-up to #1322 Rename torch_amp_options to torch_amp. Allow simply `torch_amp = 'float16'` in config. Allow to specify grad_scaler separately.
I renamed I was even thinking about renaming it to just |
I see no indication why, unless you really want your whole network to run in float16. |
Because that further reduce the memory requirement? Why would you not want that? What are the downsides? I'm not saying that everything should be float16. Maybe certain ops must stay in float32. I thought this is the main aspect of I just don't understand why weights are stored in float32, and then always auto-casted. That also adds some overhead in computation (the casting), and requires more memory. Unless there is maybe some reason. But that is my question, what is the reason for this? |
Ah, I was just checking the original paper introducing automatic mixed precision training, and it explains it (Sec 3.1):
|
uses torch_amp_options as config dict with "dtype" option. Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled.
Now that the code is longer, we might want to move this into the updater or add an extra module instead of having it plain in the engine. |
Uses
torch_amp
as config dict withdtype
option.Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled. Uses
grad_scaler
as config option to explicitly configure it.Fixes #1334.